Introduction and Objectives

Every night, hundreds of thousands of tourists opt to pay for and stay in accommodations provided by strangers through the Airbnb website instead of booking traditional lodging such as hotels. Since its inception in 2008, Airbnb has offered an online platform where individuals can rent various types of properties, including rooms, apartments, houses, and occasionally more unique accommodations. Over the years, Airbnb has experienced rapid and extensive growth, making it possible for anyone to find and rent a place virtually anywhere in the world.

This report focuses on Paris, the capital of France, aiming to analyze general trends regarding the prices set by hosts in the city. Our analysis is structured around four main objectives. Firstly, we aim to identify the relationship between prices and apartment features, with a specific emphasis on understanding how various factors such as size, amenities, and location influence rental rates. Secondly, we will delve into the habits of Parisian hosts, seeking to determine the typical number of apartments each owner offers for rent, providing insights into the scale of their operations. Thirdly, we will adopt a geographical approach to assess the renting prices per city quarter, known as “arrondissements,” examining how different areas within Paris correlate with varying price ranges. Finally, we will longitudinally examine the visit frequency of the different quarters over time, providing insights into the popularity and demand dynamics of various neighborhoods in Paris among Airbnb users.

Folders and Files

The exercise comprises the following folders and files: - The app.R R script, which contains the Shiny web application including both the server and the user interface. - The data provided for the development of this exercise is stored in an .RData file named AirBnB.RData. This file contains data related to Airbnb listings in Paris.

Exercise Tasks

For this exercise, the objective is to explore and analyze the Paris dataset by creating a Shiny application. The application should include the following functionalities:

  1. Relationship between Prices and Apartment Features: Analyze the relationship between rental prices and various apartment features such as the number of bedrooms, bathrooms, beds, and capacity to accommodate guests. Visualize this relationship through interactive plots or charts.

  2. Number of Apartments per Owner: Calculate and display the number of apartments owned by each host. This analysis provides insights into the distribution of listings among different property owners.

  3. Renting Price per City Quarter (“Arrondissements”): Explore the renting prices across different city quarters (arrondissements) in Paris. Analyze the variation in prices and identify areas with higher or lower rental rates. Visualize this information using interactive maps or charts.

  4. Visit Frequency of Different Quarters According to Time: Determine the frequency of visits to different city quarters over time. Analyze trends in visitor activity and identify popular quarters during specific periods. Visualize visit frequency using different plots.

The Shiny application should provide an intuitive and user-friendly interface for users to interact with the data and explore various insights related to Airbnb listings in Paris.

Approach

In this analysis, we considered several key features present in the dataset to gain insights into the Airbnb listings. The features investigated are as follows:

By analyzing these features, we aimed to uncover patterns, trends, and relationships within the dataset, providing valuable insights into the Airbnb market in the study area. The findings from this analysis can inform various stakeholders, including hosts, guests, and policymakers, in making informed decisions related to Airbnb accommodations.

Software and packages

library(DataExplorer)
library(skimr)
library(tidyr)
library(shiny)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
##   Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
##   OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(ggpubr)
library(writexl)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggmap':
## 
##     wind
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(lubridate)
library(leaflet)
library(corrplot)
## corrplot 0.92 loaded
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(here)
## here() starts at D:/Archana DSTI/Big Data Processing with R
library(zoo)
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

Preprocessing the dataset

First, load the dataset:

org_data <- load("D:/Archana DSTI/Big Data Processing with R/AirBnB.Rdata")               
org_data
## [1] "L" "R"

Two lists are retrieved with names L and R

When you run “View(L) and View(R)” commands, you’ll see the data from the data frames L and R displayed in a visual format directly within the R environment. This makes it easier for you to look at the data and understand what it contains, helping you explore and make sense of it more effectively.

View(L)
View(R)

We observe the following:

L will be utilized for analyzing features, while R will be employed to compute the visit frequency of different quarters over time.

Generate a summary of the dataset L using the skim() function

skim(L)
## Warning: There was 1 warning in `dplyr::summarize()`.
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
##   mangled_skimmers$funs)`.
## ℹ In group 0: .
## Caused by warning:
## ! There were 39 warnings in `dplyr::summarize()`.
## The first warning was:
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
##   mangled_skimmers$funs)`.
## Caused by warning in `sorted_count()`:
## ! Variable contains value(s) of "" that have been converted to "empty".
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 38 remaining warnings.
Data summary
Name L
Number of rows 52725
Number of columns 95
_______________________
Column type frequency:
factor 64
logical 2
numeric 29
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
listing_url 0 1 FALSE 52725 htt: 1, htt: 1, htt: 1, htt: 1
last_scraped 0 1 FALSE 2 201: 28982, 201: 23743
name 0 1 FALSE 50132 Cha: 39, App: 25, Cos: 23, Stu: 23
summary 0 1 FALSE 49385 emp: 2743, Mon: 45, Mon: 20, Mon: 17
space 1 1 FALSE 38156 emp: 14253, The: 12, La : 8, The: 6
description 0 1 FALSE 52447 Mon: 31, Mon: 15, Hel: 14, Mon: 12
experiences_offered 0 1 FALSE 1 non: 52725
neighborhood_overview 1 1 FALSE 31248 emp: 20496, Le : 36, The: 13, The: 12
notes 3 1 FALSE 15719 emp: 35361, If : 79, Les: 69, Wit: 47
transit 1 1 FALSE 33294 emp: 18546, Pub: 16, DIR: 12, Sub: 12
access 1 1 FALSE 24627 emp: 24663, Log: 118, Tou: 64, The: 48
interaction 1 1 FALSE 23646 emp: 26874, A f: 75, Non: 64, Nou: 52
house_rules 1 1 FALSE 27163 emp: 22345, .: 126, Reg: 98, Dép: 70
thumbnail_url 0 1 FALSE 39257 emp: 13465, htt: 2, htt: 2, htt: 2
medium_url 0 1 FALSE 39257 emp: 13465, htt: 2, htt: 2, htt: 2
picture_url 0 1 FALSE 52719 htt: 2, htt: 2, htt: 2, htt: 2
xl_picture_url 0 1 FALSE 39257 emp: 13465, htt: 2, htt: 2, htt: 2
host_url 0 1 FALSE 44874 htt: 155, htt: 139, htt: 91, htt: 80
host_name 0 1 FALSE 9344 Mar: 583, Nic: 436, Pie: 418, Car: 388
host_since 0 1 FALSE 2306 201: 166, 201: 165, 201: 155, 201: 135
host_location 0 1 FALSE 1560 Par: 40856, FR: 5463, US: 609, Par: 522
host_about 5 1 FALSE 23867 emp: 21939, We : 155, Nou: 139, .: 124
host_response_time 0 1 FALSE 6 wit: 15039, wit: 13926, N/A: 12517, wit: 10201
host_response_rate 0 1 FALSE 87 100: 26619, N/A: 12517, 90%: 2524, 80%: 1567
host_acceptance_rate 0 1 FALSE 96 100: 19680, N/A: 15591, 0%: 1377, 50%: 1292
host_is_superhost 0 1 FALSE 3 f: 50513, t: 2166, emp: 46
host_thumbnail_url 0 1 FALSE 44652 htt: 192, htt: 155, htt: 139, htt: 91
host_picture_url 0 1 FALSE 44652 htt: 192, htt: 155, htt: 139, htt: 91
host_neighbourhood 0 1 FALSE 231 emp: 6541, Mon: 2968, Rép: 2271, But: 2140
host_verifications 0 1 FALSE 136 [’e: 19488, [’e: 14829, [’e: 4194, [’e: 4085
host_has_profile_pic 0 1 FALSE 3 t: 52487, f: 192, emp: 46
host_identity_verified 0 1 FALSE 3 t: 26949, f: 25730, emp: 46
street 0 1 FALSE 8531 Par: 308, Bou: 209, Rue: 202, Rue: 202
neighbourhood 0 1 FALSE 64 emp: 7457, Mon: 2878, Rép: 2315, But: 2174
neighbourhood_cleansed 0 1 FALSE 20 But: 6025, Pop: 4883, Vau: 3878, Bat: 3603
city 0 1 FALSE 136 Par: 50825, Par: 115, Par: 106, Par: 87
state 0 1 FALSE 53 Île: 50841, IDF: 1355, Ile: 271, emp: 72
zipcode 0 1 FALSE 79 750: 5973, 750: 4825, 750: 3799, 750: 3511
market 0 1 FALSE 30 Par: 49392, emp: 3275, Oth: 15, Dal: 5
smart_location 0 1 FALSE 137 Par: 50824, Par: 115, Par: 106, Par: 87
country_code 0 1 FALSE 2 FR: 52724, CH: 1
country 0 1 FALSE 2 Fra: 52724, Swi: 1
is_location_exact 0 1 FALSE 2 t: 45356, f: 7369
property_type 0 1 FALSE 20 Apa: 50663, Lof: 567, Hou: 537, Bed: 394
room_type 0 1 FALSE 3 Ent: 45177, Pri: 7001, Sha: 547
bed_type 0 1 FALSE 5 Rea: 45993, Pul: 5066, Cou: 1182, Fut: 449
amenities 0 1 FALSE 37737 {}: 552, {TV: 95, {In: 90, {In: 68
price 0 1 FALSE 498 $60: 3055, $50: 3047, $70: 2787, $80: 2598
weekly_price 0 1 FALSE 1186 emp: 30034, $50: 1378, $40: 1291, $45: 1083
monthly_price 0 1 FALSE 1473 emp: 37531, $1,: 769, $1,: 694, $2,: 637
security_deposit 0 1 FALSE 304 emp: 20321, $30: 5421, $50: 5179, $20: 5040
cleaning_fee 0 1 FALSE 157 emp: 20122, $30: 4904, $20: 4879, $50: 3281
extra_people 0 1 FALSE 89 $0.: 37324, $10: 4453, $20: 2653, $15: 2469
calendar_updated 0 1 FALSE 61 tod: 7594, 2 w: 5237, a w: 4351, 3 w: 3499
calendar_last_scraped 0 1 FALSE 2 201: 30064, 201: 22661
first_review 0 1 FALSE 1946 emp: 14508, 201: 212, 201: 193, 201: 186
last_review 0 1 FALSE 1046 emp: 14509, 201: 1327, 201: 1202, 201: 1116
requires_license 0 1 FALSE 1 f: 52725
license 0 1 FALSE 2 emp: 52724, AJO: 1
jurisdiction_names 0 1 FALSE 2 Par: 51726, emp: 999
instant_bookable 0 1 FALSE 2 f: 44186, t: 8539
cancellation_policy 0 1 FALSE 5 fle: 19244, str: 18427, mod: 15039, sup: 9
require_guest_profile_picture 0 1 FALSE 2 f: 51816, t: 909
require_guest_phone_verification 0 1 FALSE 2 f: 51014, t: 1711

Variable type: logical

skim_variable n_missing complete_rate mean count
neighbourhood_group_cleansed 52725 0 NaN :
has_availability 52725 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 7.069608e+06 4180018.28 2.62300e+03 3.470301e+06 6.965852e+06 1.074006e+07 1.381956e+07 ▇▆▇▅▇
scrape_id 0 1.00 2.016070e+13 0.00 2.01607e+13 2.016070e+13 2.016070e+13 2.016070e+13 2.016070e+13 ▁▁▇▁▁
host_id 0 1.00 2.248560e+07 20345155.79 2.62600e+03 6.158190e+06 1.588541e+07 3.434872e+07 8.139705e+07 ▇▃▂▁▁
host_listings_count 46 1.00 5.830000e+00 28.97 0.00000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.024000e+03 ▇▁▁▁▁
host_total_listings_count 46 1.00 5.830000e+00 28.97 0.00000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.024000e+03 ▇▁▁▁▁
latitude 0 1.00 4.886000e+01 0.02 4.88100e+01 4.885000e+01 4.886000e+01 4.888000e+01 4.891000e+01 ▁▅▇▇▃
longitude 0 1.00 2.340000e+00 0.03 2.22000e+00 2.320000e+00 2.350000e+00 2.370000e+00 2.470000e+00 ▁▃▇▃▁
accommodates 0 1.00 3.050000e+00 1.46 1.00000e+00 2.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01 ▇▁▁▁▁
bathrooms 243 1.00 1.090000e+00 0.38 0.00000e+00 1.000000e+00 1.000000e+00 1.000000e+00 8.000000e+00 ▇▁▁▁▁
bedrooms 193 1.00 1.060000e+00 0.79 0.00000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+01 ▇▁▁▁▁
beds 80 1.00 1.680000e+00 1.05 0.00000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.600000e+01 ▇▁▁▁▁
square_feet 50218 0.05 3.679900e+02 485.53 0.00000e+00 0.000000e+00 3.230000e+02 5.380000e+02 1.505900e+04 ▇▁▁▁▁
guests_included 0 1.00 1.350000e+00 0.92 0.00000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.600000e+01 ▇▁▁▁▁
minimum_nights 0 1.00 3.130000e+00 8.01 1.00000e+00 1.000000e+00 2.000000e+00 3.000000e+00 1.000000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 1.252547e+05 16204416.39 1.00000e+00 6.000000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
availability_30 0 1.00 1.165000e+01 11.26 0.00000e+00 0.000000e+00 8.000000e+00 2.300000e+01 3.000000e+01 ▇▂▂▂▃
availability_60 0 1.00 2.733000e+01 22.49 0.00000e+00 2.000000e+00 2.600000e+01 5.000000e+01 6.000000e+01 ▇▃▃▂▆
availability_90 0 1.00 4.118000e+01 33.56 0.00000e+00 6.000000e+00 3.700000e+01 7.500000e+01 9.000000e+01 ▇▃▂▃▆
availability_365 0 1.00 1.794600e+02 146.77 0.00000e+00 2.200000e+01 1.830000e+02 3.360000e+02 3.650000e+02 ▇▂▂▂▇
number_of_reviews 0 1.00 1.259000e+01 25.21 0.00000e+00 0.000000e+00 3.000000e+00 1.300000e+01 3.920000e+02 ▇▁▁▁▁
review_scores_rating 15454 0.71 9.101000e+01 8.82 2.00000e+01 8.700000e+01 9.300000e+01 9.700000e+01 1.000000e+02 ▁▁▁▂▇
review_scores_accuracy 15575 0.70 9.410000e+00 0.87 2.00000e+00 9.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_cleanliness 15566 0.70 9.110000e+00 1.13 2.00000e+00 9.000000e+00 9.000000e+00 1.000000e+01 1.000000e+01 ▁▁▁▂▇
review_scores_checkin 15579 0.70 9.600000e+00 0.76 2.00000e+00 9.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_communication 15543 0.71 9.650000e+00 0.74 2.00000e+00 9.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_location 15560 0.70 9.440000e+00 0.82 2.00000e+00 9.000000e+00 1.000000e+01 1.000000e+01 1.000000e+01 ▁▁▁▁▇
review_scores_value 15559 0.70 9.180000e+00 0.91 2.00000e+00 9.000000e+00 9.000000e+00 1.000000e+01 1.000000e+01 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 4.090000e+00 14.23 1.00000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.550000e+02 ▇▁▁▁▁
reviews_per_month 14508 0.72 1.340000e+00 1.39 1.00000e-02 3.600000e-01 9.000000e-01 1.870000e+00 1.429000e+01 ▇▁▁▁▁

– Variable type: factor ———– A tibble: 64 x 6

– Variable type: logical ———- A tibble: 2 x 5

– Variable type: numeric ———- A tibble: 29 x 11

It also gives us a glimpse of the missing values, the unique values etc.

Data Processing

We begin by listing all the columns from L dataset.

colnames(L)
##  [1] "id"                               "listing_url"                     
##  [3] "scrape_id"                        "last_scraped"                    
##  [5] "name"                             "summary"                         
##  [7] "space"                            "description"                     
##  [9] "experiences_offered"              "neighborhood_overview"           
## [11] "notes"                            "transit"                         
## [13] "access"                           "interaction"                     
## [15] "house_rules"                      "thumbnail_url"                   
## [17] "medium_url"                       "picture_url"                     
## [19] "xl_picture_url"                   "host_id"                         
## [21] "host_url"                         "host_name"                       
## [23] "host_since"                       "host_location"                   
## [25] "host_about"                       "host_response_time"              
## [27] "host_response_rate"               "host_acceptance_rate"            
## [29] "host_is_superhost"                "host_thumbnail_url"              
## [31] "host_picture_url"                 "host_neighbourhood"              
## [33] "host_listings_count"              "host_total_listings_count"       
## [35] "host_verifications"               "host_has_profile_pic"            
## [37] "host_identity_verified"           "street"                          
## [39] "neighbourhood"                    "neighbourhood_cleansed"          
## [41] "neighbourhood_group_cleansed"     "city"                            
## [43] "state"                            "zipcode"                         
## [45] "market"                           "smart_location"                  
## [47] "country_code"                     "country"                         
## [49] "latitude"                         "longitude"                       
## [51] "is_location_exact"                "property_type"                   
## [53] "room_type"                        "accommodates"                    
## [55] "bathrooms"                        "bedrooms"                        
## [57] "beds"                             "bed_type"                        
## [59] "amenities"                        "square_feet"                     
## [61] "price"                            "weekly_price"                    
## [63] "monthly_price"                    "security_deposit"                
## [65] "cleaning_fee"                     "guests_included"                 
## [67] "extra_people"                     "minimum_nights"                  
## [69] "maximum_nights"                   "calendar_updated"                
## [71] "has_availability"                 "availability_30"                 
## [73] "availability_60"                  "availability_90"                 
## [75] "availability_365"                 "calendar_last_scraped"           
## [77] "number_of_reviews"                "first_review"                    
## [79] "last_review"                      "review_scores_rating"            
## [81] "review_scores_accuracy"           "review_scores_cleanliness"       
## [83] "review_scores_checkin"            "review_scores_communication"     
## [85] "review_scores_location"           "review_scores_value"             
## [87] "requires_license"                 "license"                         
## [89] "jurisdiction_names"               "instant_bookable"                
## [91] "cancellation_policy"              "require_guest_profile_picture"   
## [93] "require_guest_phone_verification" "calculated_host_listings_count"  
## [95] "reviews_per_month"

In order to preserve the original dataset, we will creqte a new one, called New_data to keep only the relevant columns.

Using the select clause, a subset of the L dataset is created to use only the variables (out of the 95) that will be useful for the project:

New_data <- select(L, listing_id =id, Host_id= host_id, Host_name= host_name, bathrooms, bedrooms, beds, bed_type, Equipments= amenities, Property_type= property_type, Room_type= room_type, Nb_of_guests= accommodates,Price= price, guests_included, minimum_nights, maximum_nights,availability_over_one_year= availability_365, instant_bookable, cancellation_policy, city, Adresse= street, Neighbourhood=neighbourhood_cleansed, city_quarter=zipcode, latitude, longitude, security_deposit, transit, host_response_time, Superhost= host_is_superhost, Host_since= host_since, Listing_count= calculated_host_listings_count, Host_score= review_scores_rating, reviews_per_month,number_of_reviews,square_feet)

Retrieve the column names of the New_data dataframe

colnames(New_data)
##  [1] "listing_id"                 "Host_id"                   
##  [3] "Host_name"                  "bathrooms"                 
##  [5] "bedrooms"                   "beds"                      
##  [7] "bed_type"                   "Equipments"                
##  [9] "Property_type"              "Room_type"                 
## [11] "Nb_of_guests"               "Price"                     
## [13] "guests_included"            "minimum_nights"            
## [15] "maximum_nights"             "availability_over_one_year"
## [17] "instant_bookable"           "cancellation_policy"       
## [19] "city"                       "Adresse"                   
## [21] "Neighbourhood"              "city_quarter"              
## [23] "latitude"                   "longitude"                 
## [25] "security_deposit"           "transit"                   
## [27] "host_response_time"         "Superhost"                 
## [29] "Host_since"                 "Listing_count"             
## [31] "Host_score"                 "reviews_per_month"         
## [33] "number_of_reviews"          "square_feet"

Remove duplicate entries from the dataset

Also, the $ sign in the prices will give us problem when manipulating the numbers so it needs to be removed as well:

New_data <- New_data %>% distinct(listing_id, .keep_all = TRUE)

To be able to manipulate them like numeric ones, we need to ensure that they are loaded with the appropriate data type, especially the “Price” column.

For this particular column, we see that :

# Removing the "$" character
New_data$Price <- substring(gsub(",", "", as.character(New_data$Price)),2)

Let’s take a glimpse at ‘Price’ column in the New_data dataframe to verify that the $ symbol is removed

glimpse(New_data[,"Price"])
##  chr [1:52725] "60.00" "200.00" "80.00" "60.00" "50.00" "191.00" "100.00" ...

Data Type Conversion

Let’s take a look into the data types in the ‘New_data’ dataset.

data_types <- data.frame(Column_Name = names(New_data), Data_Type = sapply(New_data, class))
print(data_types)
##                                           Column_Name Data_Type
## listing_id                                 listing_id   integer
## Host_id                                       Host_id   integer
## Host_name                                   Host_name    factor
## bathrooms                                   bathrooms   numeric
## bedrooms                                     bedrooms   integer
## beds                                             beds   integer
## bed_type                                     bed_type    factor
## Equipments                                 Equipments    factor
## Property_type                           Property_type    factor
## Room_type                                   Room_type    factor
## Nb_of_guests                             Nb_of_guests   integer
## Price                                           Price character
## guests_included                       guests_included   integer
## minimum_nights                         minimum_nights   integer
## maximum_nights                         maximum_nights   integer
## availability_over_one_year availability_over_one_year   integer
## instant_bookable                     instant_bookable    factor
## cancellation_policy               cancellation_policy    factor
## city                                             city    factor
## Adresse                                       Adresse    factor
## Neighbourhood                           Neighbourhood    factor
## city_quarter                             city_quarter    factor
## latitude                                     latitude   numeric
## longitude                                   longitude   numeric
## security_deposit                     security_deposit    factor
## transit                                       transit    factor
## host_response_time                 host_response_time    factor
## Superhost                                   Superhost    factor
## Host_since                                 Host_since    factor
## Listing_count                           Listing_count   integer
## Host_score                                 Host_score   integer
## reviews_per_month                   reviews_per_month   numeric
## number_of_reviews                   number_of_reviews   integer
## square_feet                               square_feet   integer

To ensure that the variables have appropriate data type, we need to apply data type conversions as following:

1. Converting to numeric columns:

# Changing the data type
New_data$bedrooms <- as.numeric((New_data$bedrooms))
New_data$beds <- as.numeric((New_data$beds))
New_data$Price <- as.numeric((New_data$Price))
New_data$guests_included <- as.numeric((New_data$guests_included))
New_data$minimum_nights <- as.numeric((New_data$minimum_nights))
New_data$maximum_nights <- as.numeric((New_data$maximum_nights))
New_data$availability_over_one_year <- as.numeric((New_data$availability_over_one_year))
New_data$security_deposit <- as.numeric((New_data$security_deposit))
New_data$Listing_count <- as.numeric((New_data$Listing_count))
New_data$Host_score <- as.numeric((New_data$Host_score))
New_data$number_of_reviews <- as.numeric((New_data$number_of_reviews))
New_data$square_feet <- as.numeric((New_data$square_feet))

2. Converting to character columns:

New_data$Neighbourhood <- as.character(New_data$Neighbourhood)

3. Converting to date columns

New_data$Host_since <- as.Date(New_data$Host_since)

Finally, let’s ensure the data types are updated.

data_types <- data.frame(Column_Name = names(New_data), Data_Type = sapply(New_data, class))
print(data_types)
##                                           Column_Name Data_Type
## listing_id                                 listing_id   integer
## Host_id                                       Host_id   integer
## Host_name                                   Host_name    factor
## bathrooms                                   bathrooms   numeric
## bedrooms                                     bedrooms   numeric
## beds                                             beds   numeric
## bed_type                                     bed_type    factor
## Equipments                                 Equipments    factor
## Property_type                           Property_type    factor
## Room_type                                   Room_type    factor
## Nb_of_guests                             Nb_of_guests   integer
## Price                                           Price   numeric
## guests_included                       guests_included   numeric
## minimum_nights                         minimum_nights   numeric
## maximum_nights                         maximum_nights   numeric
## availability_over_one_year availability_over_one_year   numeric
## instant_bookable                     instant_bookable    factor
## cancellation_policy               cancellation_policy    factor
## city                                             city    factor
## Adresse                                       Adresse    factor
## Neighbourhood                           Neighbourhood character
## city_quarter                             city_quarter    factor
## latitude                                     latitude   numeric
## longitude                                   longitude   numeric
## security_deposit                     security_deposit   numeric
## transit                                       transit    factor
## host_response_time                 host_response_time    factor
## Superhost                                   Superhost    factor
## Host_since                                 Host_since      Date
## Listing_count                           Listing_count   numeric
## Host_score                                 Host_score   numeric
## reviews_per_month                   reviews_per_month   numeric
## number_of_reviews                   number_of_reviews   numeric
## square_feet                               square_feet   numeric

Removing Outliers

summary(New_data$Price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   55.00   75.00   96.51  110.00 6081.00

A quick computation using the R summary() function – as done above – shows that the minimum price is $0 and the maximum is $6081. Although human kindness is limitless, free rent do not exist in AirBnB. Additionally, it sounds unreasonable to spend $6081 to rent a property for one night. At the time of writing, a quick request for renting in Paris using AirBnB website revealed that the range of price goes from around $20 to approximatively $1300. Consequently we will use these values as range for the variable price and remove the outliers.

Airbnb web page
Airbnb web page
# Setting the price range
New_data <- New_data %>%
filter(New_data$Price >= 20 &
       New_data$Price <= 1300)
summary(New_data$Price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   55.00   75.00   95.74  110.00 1285.00

After cleaning up the dataset, we can see that median of price is still $75 with a minimum at $20 and and a maximum at $1285.

Quantifying missing values

In a first step we are going to assess the robustness and relevance of the selected data by notably assessing the amount of missing values present in each variable.

# Count missing values for all columns
missing_counts <- colSums(is.na(New_data))

# Print the results
missing_counts
##                 listing_id                    Host_id 
##                          0                          0 
##                  Host_name                  bathrooms 
##                          0                        243 
##                   bedrooms                       beds 
##                        193                         80 
##                   bed_type                 Equipments 
##                          0                          0 
##              Property_type                  Room_type 
##                          0                          0 
##               Nb_of_guests                      Price 
##                          0                          0 
##            guests_included             minimum_nights 
##                          0                          0 
##             maximum_nights availability_over_one_year 
##                          0                          0 
##           instant_bookable        cancellation_policy 
##                          0                          0 
##                       city                    Adresse 
##                          0                          0 
##              Neighbourhood               city_quarter 
##                          0                          0 
##                   latitude                  longitude 
##                          0                          0 
##           security_deposit                    transit 
##                          0                          1 
##         host_response_time                  Superhost 
##                          0                          0 
##                 Host_since              Listing_count 
##                         46                          0 
##                 Host_score          reviews_per_month 
##                      15398                      14458 
##          number_of_reviews                square_feet 
##                          0                      50111

Here we found that 95% of values present in the square_feet variable correspond to missing data. A very small proportion of values in bedrooms and bathrooms columns are also missing. This observation prompts us to unambiguously suppress the square_feet variable from airbnb_data. In contrast, handling missing values for bedrooms and bathrooms requires a bit of a discussion. Indeed, we could use different approaches here. First, we could fill the missing values by replacing them with the most representative value. As the variables ‘bedrooms’ and ‘bathrooms’ are categorical, we could use the mean to fill in the missing values with it. Another approach could be to simply remove these rows from the dataset as they represent so little. This will not affect the overall dataset and analysis.

Fill in missing value

The approach followed in this case is to fill the missing values with the mean value of the corresponding column (bathrooms, bedrooms and beds):

#Bathrooms
# Calculate the mean value of the "bathrooms" column
mean_value <- mean(New_data$bathrooms, na.rm = TRUE)
# Replace missing values with the mean value
New_data$bathrooms <- na.aggregate(New_data$bathrooms)
# Calculate the mean value again to see the value with which missing values were filled
mean_value_filled <- mean(New_data$bathrooms)
print(paste("Value with which missing values are filled:", mean_value_filled))
## [1] "Value with which missing values are filled: 1.08914898898287"
#Bedrooms
# Calculate the mean value of the "bedrooms" column
mean_value_bedrooms <- mean(New_data$bedrooms, na.rm = TRUE)
# Replace missing values with the mean value
New_data$bedrooms <- na.aggregate(New_data$bedrooms)
# Calculate the mean value again to see the value with which missing values were filled
mean_value_filled_bedrooms <- mean(New_data$bedrooms)
print(paste("Value with which missing values are filled", mean_value_filled_bedrooms))
## [1] "Value with which missing values are filled 1.0583904011598"
#Beds
# Calculate the mean value of the "beds" column
mean_value_beds <- mean(New_data$beds, na.rm = TRUE)
# Replace missing values with the mean value
New_data$beds <- na.aggregate(New_data$beds)
# Calculate the mean value again to see the value with which missing values were filled
mean_value_filled_beds <- mean(New_data$beds)
print(paste("Value with which missing values are filled", mean_value_filled_beds))
## [1] "Value with which missing values are filled 1.68261763362266"

In this analysis, we extracted distinct values from the “Neighbourhood” column of the New_data dataframe, checking for any spelling variations or inconsistencies across neighborhoods. The resulting list provides a comprehensive overview of unique neighborhood names within the dataset.

# Get distinct values from the "Neighbourhood" column in the New_data dataframe
distinct_neighbourhoods <- unique(New_data$Neighbourhood)
distinct_neighbourhoods_list <- as.list(distinct_neighbourhoods)
distinct_neighbourhoods_list
## [[1]]
## [1] "Batignolles-Monceau"
## 
## [[2]]
## [1] "Palais-Bourbon"
## 
## [[3]]
## [1] "Buttes-Chaumont"
## 
## [[4]]
## [1] "Opéra"
## 
## [[5]]
## [1] "Entrepôt"
## 
## [[6]]
## [1] "Gobelins"
## 
## [[7]]
## [1] "Vaugirard"
## 
## [[8]]
## [1] "Reuilly"
## 
## [[9]]
## [1] "Louvre"
## 
## [[10]]
## [1] "Luxembourg"
## 
## [[11]]
## [1] "Élysée"
## 
## [[12]]
## [1] "Temple"
## 
## [[13]]
## [1] "Ménilmontant"
## 
## [[14]]
## [1] "Panthéon"
## 
## [[15]]
## [1] "Passy"
## 
## [[16]]
## [1] "Observatoire"
## 
## [[17]]
## [1] "Popincourt"
## 
## [[18]]
## [1] "Bourse"
## 
## [[19]]
## [1] "Buttes-Montmartre"
## 
## [[20]]
## [1] "Hôtel-de-Ville"

City_quarter Column cleaning

# Cleaning the city quarters (Arrondissements):
New_data$city = str_sub(New_data$city,1, 5)
New_data$city_quarter = str_sub(New_data$city_quarter, -2)
New_data <- subset(New_data, New_data$city == 'Paris' & New_data$city_quarter != "" & New_data$city_quarter <= 20 & New_data$city_quarter != '00' & New_data$city_quarter != ' ')
unique_values <- unique(New_data$city_quarter)

# Prinitng unique values of the city quarters (arrondissements)
print(unique_values)
##  [1] "17" "08" "18" "13" "16" "09" "10" "07" "15" "06" "19" "01" "20" "11" "04"
## [16] "02" "03" "12" "05" "14"

The subset of the New_data dataset comprises records corresponding to properties located in Paris. Specifically, it includes entries where the city quarter (or arrondissement) information is available and falls within the range of 01 to 20, excluding ‘00’. This filtering ensures that only relevant data related to properties situated in Paris and categorized within valid city quarters (Arrondissements) is retained for further analysis.

The data is now cleaned, let’s have a look at the first rows of our new dataset and also the summary

head(New_data)
##   listing_id  Host_id Host_name bathrooms bedrooms beds bed_type
## 1    4867396  9703910  Matthieu         1        1    1 Real Bed
## 2    7704653 35777602    Claire         2        2    3 Real Bed
## 3    2725029 13945253   Vincent         1        1    1 Real Bed
## 4    9337509  5107123     Julie         1        1    1 Real Bed
## 5   12928158 51195601   Daniele         1        1    1 Real Bed
## 6    5589471 28980052  Philippe         3        4    4 Real Bed
##                                                                                                                                                      Equipments
## 1                                                                          {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,Heating,Washer,Dryer,Essentials}
## 2                                                       {"Wireless Internet",Kitchen,"Elevator in Building","Buzzer/Wireless Intercom",Washer,Dryer,Essentials}
## 3                                          {TV,Internet,"Wireless Internet",Kitchen,"Indoor Fireplace",Heating,"Family/Kid Friendly",Washer,Essentials,Shampoo}
## 4                                                                                                       {"Wireless Internet",Kitchen,Heating,Washer,Essentials}
## 5 {"Wireless Internet",Kitchen,"Smoking Allowed","Pets Allowed",Breakfast,"Elevator in Building",Heating,"Family/Kid Friendly",Washer,Dryer,Essentials,Shampoo}
## 6                          {TV,Internet,"Wireless Internet",Kitchen,Heating,"Family/Kid Friendly",Washer,Dryer,"Smoke Detector","Fire Extinguisher",Essentials}
##   Property_type       Room_type Nb_of_guests Price guests_included
## 1     Apartment Entire home/apt            2    60               1
## 2     Apartment Entire home/apt            4   200               1
## 3     Apartment Entire home/apt            2    80               1
## 4     Apartment Entire home/apt            2    60               0
## 5     Apartment    Private room            2    50               1
## 6         House Entire home/apt            6   191               1
##   minimum_nights maximum_nights availability_over_one_year instant_bookable
## 1              1           1125                          0                f
## 2              1           1125                          0                f
## 3              3           1125                        298                f
## 4              2           1125                        364                f
## 5              1             30                         89                f
## 6              3           1125                          0                f
##   cancellation_policy  city
## 1            flexible Paris
## 2            flexible Paris
## 3            flexible Paris
## 4            flexible Paris
## 5            flexible Paris
## 6            flexible Paris
##                                                 Adresse       Neighbourhood
## 1      Rue Legendre, Paris, Île-de-France 75017, France Batignolles-Monceau
## 2  Avenue Mac-Mahon, Paris, Île-de-France 75017, France Batignolles-Monceau
## 3  Rue la Condamine, Paris, Île-de-France 75017, France Batignolles-Monceau
## 4       Rue Gauthey, Paris, Île-de-France 75017, France Batignolles-Monceau
## 5 Avenue Brunetière, Paris, Île-de-France 75017, France Batignolles-Monceau
## 6   Rue de Saussure, Paris, Île-de-France 75017, France Batignolles-Monceau
##   city_quarter latitude longitude security_deposit transit host_response_time
## 1           17 48.88880  2.320466               94                        N/A
## 2           17 48.87664  2.293724                1                        N/A
## 3           17 48.88384  2.321031              208             within an hour
## 4           17 48.89236  2.322338              106               within a day
## 5           17 48.88942  2.298321                1             within an hour
## 6           17 48.88707  2.312212                1                        N/A
##   Superhost Host_since Listing_count Host_score reviews_per_month
## 1         f 2013-10-29             1        100              0.07
## 2         f 2015-06-14             1         NA                NA
## 3         f 2014-04-06             1         80              0.11
## 4         f 2013-02-16             1         80              0.15
## 5         f 2015-12-13             1        100              2.00
## 6         f 2015-03-08             1         NA                NA
##   number_of_reviews square_feet
## 1                 1          NA
## 2                 0          NA
## 3                 1          NA
## 4                 1          NA
## 5                 2          NA
## 6                 0          NA

:

summary(New_data)
##    listing_id          Host_id            Host_name       bathrooms    
##  Min.   :    2623   Min.   :    2626   Marie   :  564   Min.   :0.000  
##  1st Qu.: 3436213   1st Qu.: 6088109   Nicolas :  427   1st Qu.:1.000  
##  Median : 6920193   Median :15713334   Pierre  :  408   Median :1.000  
##  Mean   : 7007274   Mean   :22236938   Caroline:  380   Mean   :1.089  
##  3rd Qu.:10563073   3rd Qu.:33957264   Anne    :  377   3rd Qu.:1.000  
##  Max.   :13819560   Max.   :81397049   Sophie  :  365   Max.   :8.000  
##                                        (Other) :48792                  
##     bedrooms           beds                 bed_type    
##  Min.   : 0.000   Min.   : 0.000   Airbed       :   27  
##  1st Qu.: 1.000   1st Qu.: 1.000   Couch        : 1159  
##  Median : 1.000   Median : 1.000   Futon        :  433  
##  Mean   : 1.057   Mean   : 1.682   Pull-out Sofa: 4923  
##  3rd Qu.: 1.000   3rd Qu.: 2.000   Real Bed     :44771  
##  Max.   :10.000   Max.   :16.000                        
##                                                         
##                                                                           Equipments   
##  {}                                                                            :  532  
##  {TV,Internet,"Wireless Internet",Kitchen,Heating,Washer,Essentials}           :   93  
##  {Internet,"Wireless Internet",Kitchen,Heating,Washer,Essentials}              :   90  
##  {Internet,"Wireless Internet",Kitchen,Heating,Essentials}                     :   67  
##  {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,Heating,Washer,Essentials}:   64  
##  {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,Heating,Washer}           :   64  
##  (Other)                                                                       :50403  
##          Property_type             Room_type      Nb_of_guests   
##  Apartment      :49355   Entire home/apt:44083   Min.   : 1.000  
##  Loft           :  549   Private room   : 6745   1st Qu.: 2.000  
##  House          :  508   Shared room    :  485   Median : 2.000  
##  Bed & Breakfast:  375                           Mean   : 3.052  
##  Condominium    :  255                           3rd Qu.: 4.000  
##  Other          :  117                           Max.   :16.000  
##  (Other)        :  154                                           
##      Price         guests_included  minimum_nights     maximum_nights     
##  Min.   :  20.00   Min.   : 0.000   Min.   :   1.000   Min.   :1.000e+00  
##  1st Qu.:  55.00   1st Qu.: 1.000   1st Qu.:   1.000   1st Qu.:6.000e+01  
##  Median :  75.00   Median : 1.000   Median :   2.000   Median :1.125e+03  
##  Mean   :  96.14   Mean   : 1.356   Mean   :   3.131   Mean   :1.287e+05  
##  3rd Qu.: 111.00   3rd Qu.: 2.000   3rd Qu.:   3.000   3rd Qu.:1.125e+03  
##  Max.   :1285.00   Max.   :16.000   Max.   :1000.000   Max.   :2.147e+09  
##                                                                           
##  availability_over_one_year instant_bookable      cancellation_policy
##  Min.   :  0                f:43069          flexible       :18526   
##  1st Qu.: 22                t: 8244          moderate       :14720   
##  Median :183                                 strict         :18057   
##  Mean   :180                                 super_strict_30:    5   
##  3rd Qu.:336                                 super_strict_60:    5   
##  Max.   :365                                                         
##                                                                      
##      city          
##  Length:51313      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##                    
##                                                              Adresse     
##  Boulevard Voltaire, Paris, Île-de-France 75011, France          :  209  
##  Rue du Faubourg Saint-Martin, Paris, Île-de-France 75010, France:  202  
##  Rue Oberkampf, Paris, Île-de-France 75011, France               :  201  
##  Rue Saint-Maur, Paris, Île-de-France 75011, France              :  196  
##  Rue de Charenton, Paris, Île-de-France 75012, France            :  188  
##  Rue du Faubourg Saint-Denis, Paris, Île-de-France 75010, France :  174  
##  (Other)                                                         :50143  
##  Neighbourhood      city_quarter          latitude       longitude    
##  Length:51313       Length:51313       Min.   :48.82   Min.   :2.230  
##  Class :character   Class :character   1st Qu.:48.85   1st Qu.:2.323  
##  Mode  :character   Mode  :character   Median :48.86   Median :2.347  
##                                        Mean   :48.86   Mean   :2.344  
##                                        3rd Qu.:48.88   3rd Qu.:2.369  
##                                        Max.   :48.90   Max.   :2.459  
##                                                                       
##  security_deposit
##  Min.   :  1.00  
##  1st Qu.:  1.00  
##  Median : 58.00  
##  Mean   : 81.72  
##  3rd Qu.:129.00  
##  Max.   :304.00  
##                  
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        transit     
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            :17900  
##  Public transportation is a bit of a maze in Paris. I recommend you to book a transfer on the app Bonjour Paris (G00gle or Apple store).                                                                                                                                                                                                                                                                                                                                                   :   16  
##  DIRECT ACCESS From Airport CDG (Charles de Gaule-Roissy)  DIRECT ACCESS From Airport  ORLY EASY & FAST ACCESS from TRAIN STATIONS METRO Station Saint Michel line 4 is 3 minutes by foot from my place RER Station  Saint Michel line B is 3 minutes by foot from my place TAXI STATION is 3 minutes by foot from my place By CAR : 2 choices of PARKING both 5 minutes by foot from my place : “Parking Saint Michel” Rue Francisque Gay n°46 and “Parking Notre Dame” Place Jean Paul II:   12  
##  Subway: Châtelet (lines 1, 4, 7, 11 & 14, RER A, B & D)                                                                                                                                                                                                                                                                                                                                                                                                                                   :   12  
##  Odéon station line 4 and 10 Saint Michel station line 4, RER B and RER C                                                                                                                                                                                                                                                                                                                                                                                                                  :   10  
##  (Other)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   :33362  
##  NA's                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      :    1  
##           host_response_time Superhost   Host_since         Listing_count    
##                    :   44     :   44   Min.   :2008-08-30   Min.   :  1.000  
##  a few days or more:  973    f:49124   1st Qu.:2013-04-27   1st Qu.:  1.000  
##  N/A               :12118    t: 2145   Median :2014-05-23   Median :  1.000  
##  within a day      : 9969              Mean   :2014-04-03   Mean   :  4.129  
##  within a few hours:13596              3rd Qu.:2015-05-27   3rd Qu.:  1.000  
##  within an hour    :14613              Max.   :2016-07-03   Max.   :155.000  
##                                        NA's   :44                            
##    Host_score     reviews_per_month number_of_reviews  square_feet     
##  Min.   : 20.00   Min.   : 0.010    Min.   :  0.00    Min.   :    0.0  
##  1st Qu.: 87.00   1st Qu.: 0.360    1st Qu.:  0.00    1st Qu.:    0.0  
##  Median : 93.00   Median : 0.900    Median :  3.00    Median :  323.0  
##  Mean   : 91.02   Mean   : 1.335    Mean   : 12.78    Mean   :  368.4  
##  3rd Qu.: 97.00   3rd Qu.: 1.860    3rd Qu.: 13.00    3rd Qu.:  538.0  
##  Max.   :100.00   Max.   :14.290    Max.   :392.00    Max.   :15059.0  
##  NA's   :14724    NA's   :13826                       NA's   :48833

Analysis

Relationship between Prices and Apartment Features

As a customer, the primary consideration when renting a place is the price. The variability in prices is inherently influenced by the type of property and room being rented. For instance, a shared room in a dormitory may have a different price range compared to a shared room in a large villa. Similarly, renting a full apartment is expected to be more expensive than renting a single room. To gain a deeper understanding of pricing dynamics in Paris, we first investigate the Airbnb dataset from this perspective. Our aim is to decipher the key factors influencing property prices and, specifically, identify the features that most significantly impact apartment prices offered by Airbnb hosts in Paris.

The Parisian offering on Airbnb is predominantly composed of entire apartments available for rent

To streamline our analysis and focus on relevant features, we aim to reduce the size of our dataset. Assuming that a rented property features commonly include equipped kitchen, television, wifi or internet, sofa, etc., we prioritize selecting key attributes that customers typically consider when renting a place. These include the type of room or property, as well as the number of rooms and bathrooms, which are among the most salient factors influencing rental decisions.

features_and_price <- New_data %>%
  select(Property_type,
         Room_type,
         bathrooms,
         bedrooms,
         beds,
         Neighbourhood,
         Nb_of_guests,
         Price)

View(features_and_price)

Correlation between Price and Apartment Features

# Plot the correlation matrix
cor_featuer_and_price <- features_and_price[, sapply(features_and_price, is.numeric)]
cor_featuer_and_price <- cor_featuer_and_price[complete.cases(cor_featuer_and_price), ]
correlation_matrix <- cor(cor_featuer_and_price, method = "spearman")
corrplot(correlation_matrix, method = "color", main = "")

Target variable Price has positive correlation with : bathrooms, beds, bedrooms, and number of guests. Thus, we can analyze the relationship between the price and some of these variables.

p1<- ggplot(features_and_price) + 
  geom_histogram(aes(Price), fill = "#971a4a", alpha = 0.85, binwidth = 15) + 
  theme_minimal(base_size = 13) + 
  xlab("Price") + 
  ylab("Frequency") + 
  ggtitle("Distribution of Price")

p2 <- ggplot(features_and_price, aes(Price)) +
  geom_histogram(bins = 30, aes(y = ..density..), fill = "#971a4a") + 
  geom_density(alpha = 0.2, fill = "#971a4a") + 
  ggtitle("Logarithmic distribution of Price", subtitle = expression("With" ~'log'[10] ~ "transformation of x-axis")) + 
  scale_x_log10()


ggarrange(p1,
          p2,
          nrow = 1,
          ncol=2,
          labels = c("1. ", "2. "))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

In the logarithmic distribution of the variable price a better insight view of this variable can be perceived. The distribution is not gaussian but remains less skewed . Next we will investigate if prices are different between property types and room types proposed in Paris by AirBnB hosts.

Price & Types of listings

Analysis by Room type

Let’s analyze more closely the total count for each distinct room type.

# Count the total number of occurrences for each distinct room type
room_type_counts <- features_and_price %>%
  group_by(Room_type) %>%
  summarize(Total_Count = n())

# Print the total count for each distinct room type
print(room_type_counts)
## # A tibble: 3 × 2
##   Room_type       Total_Count
##   <fct>                 <int>
## 1 Entire home/apt       44083
## 2 Private room           6745
## 3 Shared room             485

Now, let’s plot the distribution of room types in the dataset using a polar bar chart, making it easier to compare the relative frequencies of different room types.

room_types_counts <- table(features_and_price$Room_type)
room_types <- names(room_types_counts)
counts <- as.vector(room_types_counts)
percentages <- scales::percent(round(counts/sum(counts), 2))
room_types_percentages <- sprintf("%s (%s)", room_types, percentages)
room_types_counts_df <- data.frame(group = room_types, value = counts)

res2 <- ggplot(room_types_counts_df, aes(x = "", y = value, fill = room_types_percentages)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  scale_fill_brewer("Room Types", palette = "BuPu") +
  ggtitle("Distribution of Room types") +
  theme(plot.title = element_text(color = "black", size = 12, hjust = 0.5)) +
  ylab("") +
  xlab("") +
  labs(fill="") +
  theme(axis.ticks = element_blank(), panel.grid = element_blank(), axis.text = element_blank()) +
  geom_text(aes(label = percentages), size = 5, position = position_stack(vjust = 0.5))

res2

From the plot we can deduce that people generally tend to rent the entire apartment which comprise 86% of the total distribution followed by private rooms (13%) and shared room (1%).

Distribution of the price for each room type

ggplot(features_and_price) +  
  geom_boxplot(aes(x = Room_type,y = Price,fill = Room_type)) +
  labs(x = "Room Type",y = "Price",fill = "Room Type") +  
  coord_flip()

The price increases in this order: shared room > private room > entire home/apt. Let’s have a look at the average price by room type.

Average price by Room type

features_and_price %>% 
     group_by(Room_type) %>% 
     summarise(mean_price = mean(Price, na.rm = TRUE)) %>% 
     ggplot(aes(x = reorder(Room_type, mean_price), y = mean_price, fill = Room_type)) +
     geom_col(stat ="identity", fill="#56478b") +
     coord_flip() +
     theme_minimal() +
     labs(x = "Room Type", y = "Price") +
     geom_text(aes(label = round(mean_price,digit = 2)), hjust = 1.0, color = "white", size = 4.5) +
     ggtitle("Average Price by Room Type") + 
     xlab("Room Type") + 
     ylab("Average Price")
## Warning in geom_col(stat = "identity", fill = "#56478b"): Ignoring unknown
## parameters: `stat`

Distribution of Listings Under $1,000 by room type

ggplot(features_and_price, aes(x = Price, fill = Room_type)) +
  geom_histogram(position = "dodge") +
  scale_fill_manual(values = c("#efa35c", "#4ab8b8", "#1b3764"), name = "Room Type") +
  labs(title = "Distribution of Listings Under $1,000 by Room type", x = "Price per night", y = "Number of listings") +
  theme(plot.title=element_text(vjust=2), 
        axis.title.x=element_text(vjust=-1, face = "bold"),
        axis.title.y=element_text(vjust=4, face = "bold"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This visualization offers insights into the distribution of listing prices under $1,000 per night, illustrating the composition of room types within this price range. We notice a majority of listings are between $50 to $250.

Analysis by Property type

Let’s analyze more closely the total count for each distinct property type.

# Count the total number of occurrences for each distinct property type
property_type_counts <- features_and_price %>%
  group_by(Property_type) %>%
  summarize(Total_Count = n())

# Print the total count for each distinct property type
print(property_type_counts)
## # A tibble: 19 × 2
##    Property_type     Total_Count
##    <fct>                   <int>
##  1 ""                          3
##  2 "Apartment"             49355
##  3 "Bed & Breakfast"         375
##  4 "Boat"                     29
##  5 "Cabin"                     1
##  6 "Camper/RV"                 3
##  7 "Cave"                      1
##  8 "Chalet"                    1
##  9 "Condominium"             255
## 10 "Dorm"                     26
## 11 "Earth House"               1
## 12 "House"                   508
## 13 "Igloo"                     1
## 14 "Loft"                    549
## 15 "Other"                   117
## 16 "Tipi"                      1
## 17 "Townhouse"                77
## 18 "Treehouse"                 1
## 19 "Villa"                     9

We can see that parisian hosts propose three types of rooms: Entire home/apt, Private room and Shared room. Property types are more diverse. we have some surprising propositions there as cabin, cave, chalet, earth house or igloo. There is also a property type ‘other’ where all these unexpected propositions could have been piled in. Nevertheless, considering ‘other’ would be vague to draw any conclusion from an analysis, we will skip it in our analysis along with those for which the count is 1. Consequently, we are going to keep only the following relevant and explicit property types to perform our analysis: Apartment, Bed & Breakfast, Boat, Condominium, Dorm, House, Loft, Townhouse, Villa.

list_property_types <- c("Apartment",
                         "Bed & Breakfast",
                         "Boat",
                         "Condominium",
                         "Dorm",
                         "House",
                         "Loft",
                         "Townhouse", 
                         "Villa")

features_and_price <- features_and_price %>%
  filter(Property_type %in% list_property_types)
# Count the total number of occurrences for each distinct property type
property_type_counts <- features_and_price %>%
  group_by(Property_type) %>%
  summarize(Total_Count = n())

# Print the total count for each distinct property type
print(property_type_counts)
## # A tibble: 9 × 2
##   Property_type   Total_Count
##   <fct>                 <int>
## 1 Apartment             49355
## 2 Bed & Breakfast         375
## 3 Boat                     29
## 4 Condominium             255
## 5 Dorm                     26
## 6 House                   508
## 7 Loft                    549
## 8 Townhouse                77
## 9 Villa                     9

Distribution by property

We begin by plotting the distribution of property types in the dataset using a polar bar chart, making it easier to compare the relative frequencies of different property types.

# Calculate percentages of property types
property_type_df <- features_and_price %>%
  count(Property_type) %>%
  mutate(Percentage = n / sum(n))

# Define custom colors for the pie chart
custom_colors <- c("#ffff69", "#33a02c", "#a6cee3", "#b2df8a", "#33a02c", "#fb9a99", "#e31a1c", "#fdbf6f", "#ff7f00")

# Create the pie chart
pie_chart <- ggplot(property_type_df, aes(x = "", y = Percentage, fill = Property_type)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  scale_fill_manual("Property Types", values = custom_colors, labels = paste0(property_type_df$Property_type, ": ", scales::percent(property_type_df$Percentage))) +
  labs(title = "Distribution of Property Types", 
       fill = "Property Types",
       y = "Percentage") +
  theme_void() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  guides(fill = guide_legend(title = "Property Types", label.position = "right"))

# Display the pie chart
print(pie_chart)

We notice that apartments are the most rented out property with over 96% of distribution.

Price distribution by property

ggplot(features_and_price) +  
  geom_boxplot(aes(x = Property_type, y = Price, fill = Property_type)) +
  labs(x = "Property Type", y = "Price", fill = "Property Type", title = "Price Distribution by Property Type") +  
  coord_flip()

This visualization enables us to depict the distribution of prices across various categories of properties. Primarily, we notice that villa appers to be uniformly distributed in the price range 250 and 1250. The distribution appears similar with notable distinctions observed in the categories of Townhouse, Loft, House, and Bed & Breakfast, which exhibit higher-than-average rental prices. However, given that these property types along with others, except Apartment, collectively comprise only 4% of our dataset. So, I’ve opted not to delve deeper into their analysis.

Price Relationship with Accommodation Features

I have chosen to define features as the following key attributes: beds, bathrooms, bedrooms, and the number of guests.

We will now explore their relationship with the price using the visualization provided below:

    a1<- ggplot(data=features_and_price) +
      geom_smooth(mapping = aes(x=Price,y=beds), method = 'gam', col='grey')
    a2<- ggplot(data=features_and_price) +
      geom_smooth(mapping = aes(x=Price,y=bedrooms), method = 'gam', col='blue')
    a3<- ggplot(data=features_and_price) +
      geom_smooth(mapping = aes(x=Price,y=bathrooms), method = 'gam', col='violet')
    a4<- ggplot(data=features_and_price) +  
      geom_smooth(mapping = aes(x=Price,y=Nb_of_guests), method = 'gam', col='black')
    
    ggarrange(a1, a2, a3, a4, nrow=2, ncol=2, align = "hv")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

Let’s analyze collectively the relationship distribution.

# ggplot code
pfeatures <- ggplot(data = features_and_price) +
  geom_smooth(mapping = aes(x = Price, y = beds, col = 'beds'), method = 'gam') +
  geom_smooth(mapping = aes(x = Price, y = bedrooms, col = 'bedrooms'), method = 'gam') +
  geom_smooth(mapping = aes(x = Price, y = bathrooms, col = 'bathrooms'), method = 'gam') +
  geom_smooth(mapping = aes(x = Price, y = Nb_of_guests, col = 'Nb_of_guests'), method = 'gam') +
  ggtitle("Price versus features") + labs(y = "Features", x = "Price") +
  scale_fill_manual()

# Convert ggplot object to plotly object
pfeatures_plotly <- ggplotly(pfeatures)
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
# Print the interactive plot
pfeatures_plotly

We can see that the price tend to go higher as the number of features increase.

Analyzing the relationship between Price and number of bathrooms

features_and_price["bathrooms"] <- features_and_price["bathrooms"] %>%
  map(., floor)

bath_distr <- (ggplot(features_and_price,
                      aes(x = Price))
               +  geom_histogram(bins = 15, 
                                 aes(y = ..density..),
                                 fill = "#66CC99")
               +  geom_density(lty = 2, color = "#fb8072")
               +  labs(title = "Distribution of prices vs Bathroom numbers",
                       x = "Price",
                       y = "Density")
               +  theme(axis.text.x = element_text(angle = 90,
                                                   hjust = 1,
                                                  vjust = 0.5),
                        axis.text.y = element_text(size = 7))
               +  facet_wrap(~ factor(bathrooms), 
                             scales = "free_y"))

bath_distr

It helps to visualize how prices are distributed across different numbers of bathrooms, providing insights into the relationship between these two variables in the dataset.

apt_features_and_price_bath <- features_and_price %>%
  filter(bathrooms <= 6)

ggplot(data = features_and_price, aes(x = bathrooms, y = Price, color=bathrooms)) +
        geom_jitter(width = 0.1,height = 0.2,size=0.1)

For the apartments with 0 bathroom, the price is significantly low. We observe that majority of apartments rented have 1, 2, or 3 bathrooms. We can also see that rented properties with either 1 bathroom or 2 bathrooms share the same price distribution and is normally under $500. For others, there is hardly any relation of bathroom with price, except for the apartments with 3 bathrooms which has a fair and uniform distribution between $50 and $1000. .

Analyzing the relationship between Price and Number of Beds

beds_distr <- (ggplot(features_and_price,
                      aes(x = Price))
               +  geom_histogram(bins = 15,
                                 aes(y = ..density..),
                                 fill = "#66CC99")
               +  geom_density(lty = 2,
                               color = "#fb8072")
               +  labs(title = "Distribution of prices vs Beds numbers",
                       x = "Price",
                       y = "")
               +  theme(axis.text.x = element_text(angle = 90,
                                                   hjust = 1,
                                                   vjust = 0.5),
                        axis.text.y = element_text(size = 7))
               +  facet_wrap(~ factor(beds),
                             scales = "free_y"))
beds_distr
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

beds_box <- (ggplot(features_and_price)
            +  geom_boxplot(aes(x = factor(round(beds)),
                            y = Price, 
                            fill = factor(beds)))
            +  labs(x = "# of Beds",
                    y = "Price",
                    fill = "# of Beds")
            +  coord_flip())

bed_scatt <- (ggplot(data = features_and_price, aes(x = beds, y = Price, color=beds)) +
        geom_jitter(width = 0.1,height = 0.2,size=0.1))

ggarrange(beds_box,
          bed_scatt,
          nrow = 2,
          ncol = 1,
          labels = c("A", "B"))

We can observe that people tend to reserve properties with 1 to 6 beds and there is no significant relationship between price and beds. apartments zith low number of beds tend to be in the same price range as the ones with 5 or 6 beds, probably because of other features.

Analyzing the relationship between Price and number of bedroom

bedroom_distr <- (ggplot(features_and_price,
                      aes(x = Price))
               +  geom_histogram(bins = 15,
                                 aes(y = ..density..),
                                 fill = "#66CC99")
               +  geom_density(lty = 2,
                               color = "#fb8072")
               +  labs(title = "Distribution of prices vs Bedrooms numbers",
                       x = "Price",
                       y = "")
               +  theme(axis.text.x = element_text(angle = 90,
                                                   hjust = 1,
                                                   vjust = 0.5),
                        axis.text.y = element_text(size = 7))
               +  facet_wrap(~ factor(bedrooms),
                             scales = "free_y"))
bedroom_distr
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

bedroom_box <- (ggplot(features_and_price)
            +  geom_boxplot(aes(x = factor(round(bedrooms)),
                            y = Price, 
                            fill = factor(bedrooms)))
            +  labs(x = "# of Bedroom",
                    y = "Price",
                    fill = "# of Bedroom")
            +  coord_flip())

bed_scatt <- (ggplot(data = features_and_price, aes(x = beds, y = Price, color=bedrooms)) +
        geom_jitter(width = 0.1,height = 0.2,size=0.1))

ggarrange(bedroom_box,
          bed_scatt,
          nrow = 2,
          ncol = 1,
          labels = c("A", "B"))

The higher number of beds (meaning the higher number of guests included), the higher is the price, but it doesn’t imply a higher number of bedrooms and bathrooms. These listings (2 to 3 guests, 1 bedroom, 1 bathroom) probably refer to a private or shared room (which are cheaper).

For the listings with more than 2 bathrooms and even if the number of guests and the price keep increasing, the number of beds and bedrooms temp to reach a maximum value.

Altogether, data suggests that the number of bathrooms is not the most reliable factor to rely on to anticipate the price of an apartment on AirBnB. The number of beds or the number of guests included however seem to be more accurate in this regard. We can clearly see an increase of prices along with these two variables.

Further Analysis

Cancellation policy and host response time

price_cancellation_policy <- ggplot(data = New_data, 
  aes(x = cancellation_policy, y = Price, color=cancellation_policy)) +
  geom_boxplot(outlier.shape = NA) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme(plot.title = element_text(color = "#971a4a", size = 12, face = "bold", hjust = 0.5))+
  coord_cartesian(ylim = c(0, 1300))

host_data_without_null_host_response_time <- subset(New_data, host_response_time != "N/A" & host_response_time != "")

price_response_time <- ggplot(data = host_data_without_null_host_response_time, 
  aes(x = host_response_time, y = Price, color = host_response_time)) + 
  geom_boxplot(outlier.shape = NA) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme(plot.title = element_text(color = "#971a4a", size = 12, face = "bold", hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 500))

ggarrange(price_response_time,
          price_cancellation_policy,
          nrow = 1,
          ncol = 2,
          labels = c("1. ", "2. "))

In the initial graph depicting the relationship between host response time and price, no discernible correlation is evident. However, upon examining the second graph, a notable influence of cancellation policy on price becomes apparent. The varying types of cancellation policies exhibit differing impacts on price, resulting in fluctuations in price levels.

Immediate reservation

ggplot(data = New_data, aes(x = instant_bookable, y = Price, color = instant_bookable)) +
       geom_boxplot(outlier.shape = NA) +coord_cartesian(ylim = c(0, 500))

When looking at how price relates to whether a listing is instantly bookable or not, there doesn’t seem to be a clear connection. The instant bookings (represented by t) share the same price variation as the property which requires host’s acceptance (represented by f).

Analyzing availability of apartment by price

ggplot(features_and_price, aes(x = Price)) +
  geom_histogram(binwidth = 100, fill = "skyblue", color = "black") +  # Adjust binwidth as needed
  labs(title = "Apartment Availability by Price",
       x = "Price",
       y = "Number of Apartments") +
  theme_minimal()

The plot above shows that there’s no clear relation between the availability of apartments and their prices. However, We can say that the apartments in the price range $100 to $200 tend be significantly hosted higher in comparison to others. This is also relating to our previous analysis of distribution of average price per accommodation type.

Let’s plot the relationship between price and apartment availability each day over a year to see the variance of apartments for a given period.

ggplot( New_data, aes(availability_over_one_year, Price)) +
  geom_point(alpha = 0.2, color = "#971a4a") +
  geom_density(stat = "identity", alpha = 0.2) +
  xlab("Availability over a year") +
  ylab("Price") +
  ggtitle("Relationship between availability over a year and price") 

The plot above shows that there’s no clear relation between the availability of apartments over a year and their prices. The prices may fluctuate abruptly throughout the year and might depend upon other factors such as location and surroundings (lake view, beach side, downtown, etc.).

Let’s understand the same variation using different plot to see if we have a clear picture on availability over a year.

hchart(New_data$availability_over_one_year, color = "#336666", name = "Availability") %>%
  hc_title(text = "Availability of listings") %>%
  hc_add_theme(hc_theme_ffx())

From the graph, we can deduce that a lot of AirBnb listings are hosted between December and January, notably due to the Christmas and New Year’s time.

Now when our analysis of price verses apartments is finsihed, let’s explore the listings by hosts or superhosts.

Number of apartments per owner

# Count the number of apartments for each distinct host ID and include the host name
apartments_per_host <- New_data %>%
  group_by(Host_id, Host_name) %>%
  summarize(Num_Apartments = n_distinct(listing_id))
## `summarise()` has grouped output by 'Host_id'. You can override using the
## `.groups` argument.
# Print the number of apartments for each distinct host ID along with the host name
print(apartments_per_host)
## # A tibble: 43,651 × 3
## # Groups:   Host_id [43,651]
##    Host_id Host_name                             Num_Apartments
##      <int> <fct>                                          <int>
##  1    2626 Franck                                             2
##  2    2883 Shayne                                             2
##  3    3631 Anne                                               1
##  4    4175 Martin                                             1
##  5    6792 Jennifer Of Cobblestone Paris Rentals             24
##  6    7749 Jules                                              1
##  7    7903 Borzou                                             1
##  8    9011 Claire                                             2
##  9    9845 Marion                                             1
## 10   12764 Dorian                                             1
## # ℹ 43,641 more rows

Top 20 ‘Number of listings’ by owners

listings_per_host <- New_data %>%
  group_by(Host_id, Host_name) %>%
  summarize(Num_Listings = n_distinct(listing_id)) %>%
  arrange(desc(Num_Listings))
## `summarise()` has grouped output by 'Host_id'. You can override using the
## `.groups` argument.
# `summarise()` has grouped output by 'Host_id'. You can override using the `.groups` argument.

# Select the top 20 hosts with the highest number of listings
top_20_listings <- head(listings_per_host, 20)

# Print the table of the top 20 hosts with the highest number of listings
print(top_20_listings)
## # A tibble: 20 × 3
## # Groups:   Host_id [20]
##     Host_id Host_name                 Num_Listings
##       <int> <fct>                            <int>
##  1  2288803 Fabien                             154
##  2  2667370 Parisian Home                      138
##  3 12984381 Olivier                             89
##  4  3972699 Hanane                              78
##  5  3943828 Caroline                            65
##  6 21630783 Pierre                              65
##  7 39922748 Clara                               63
##  8   789620 Charlotte                           60
##  9 11593703 Rudy And Benjamin                   56
## 10   152242 Delphine                            53
## 11  3971743 Diane                               53
## 12  7612270 Paul                                53
## 13  5027164 International Home Owners           52
## 14 13013633 Benjamin                            52
## 15 67879895 Guillaume                           52
## 16 23025598 My Apartment In Paris               47
## 17  5056483 Bettina                             43
## 18  1322370 Nicolas                             42
## 19  2503671 SmartFlux                           40
## 20  2107478 Philippe                            39

Plotting the distribution of hosts versus their respective listings will give us some insights. We begin by grouping the hosts to visualize better, while removing any clumsiness of distribution.

count_by_host_1 <- New_data %>% 
    group_by(Host_id) %>%
    summarise(number_apt_by_host = n()) %>%
    ungroup() %>%
    mutate(groups = case_when(
        number_apt_by_host == 1 ~ "001",
        between(number_apt_by_host, 2, 50) ~ "002-050",
        number_apt_by_host > 50 ~ "051-153"))

count_by_host_2 <- count_by_host_1 %>%
    group_by(groups) %>%
    summarise(counting = n())

# Sort the count_by_host_2 data frame by the 'counting' column in descending order
count_by_host_2 <- count_by_host_2[order(-count_by_host_2$counting), ]

# Create bar chart for number of apartments per host
bar_num_apt_by_host <- ggplot(count_by_host_2, aes(x = groups, y = counting , fill = factor(groups))) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = counting), vjust = ifelse(count_by_host_2$groups == "001", 0.0, -0.3), size = 3) +  
  labs(title = "Number of Apartments per Host Group \n ",
       x = "Host Group",
       y = "Number of Apartments",
       fill = "Group") +
  theme_minimal()

# Create bar chart for contrast between hosts and superhosts
bar_contrast_superhost <- ggplot(New_data) +
  geom_bar(aes(x = '', fill = Superhost)) +
  labs(title = "Contrast between Hosts and Superhosts",
       x = NULL,
       y = "Count",
       fill = "Superhost") +
  theme_minimal()

# Arrange plots in a grid
grid.arrange(bar_num_apt_by_host, bar_contrast_superhost, nrow = 2)

In this dataset, most of the hosts have one listing (that’s the case for 41521 owners, against only 3284 that have between 2 and 50 listings and 51 to 153 owners with 15 listings). We clearly have a minority of Superhosts in this dataset

Table of groups of owners according to their no. of apartment

table_representation <- data.frame(
  Host_Group = count_by_host_2$groups,
  Number_of_Apartments = count_by_host_2$counting
 
)
table_representation
##   Host_Group Number_of_Apartments
## 1        001                40424
## 2    002-050                 3212
## 3    051-153                   15

Airbnb growth: Evolution of new hosts over time

last_date <- max(New_data$Host_since,na.rm = TRUE)
last_date
## [1] "2016-07-03"

It provides the maximum date observed in the dataset, indicating the most recent date up to which data is available being ‘03-07-2016’.

Number of hosts per year

new_hosts_data <- drop_na(New_data, c("Host_since"))

# Calculate the number of new hosts for each year (except for 2017 since our data is not complete for this year)
new_hosts_data$Host_since <- as.Date(new_hosts_data$Host_since, '%Y-%m-%d')
new_hosts_data <- new_hosts_data[new_hosts_data$Host_since < as.Date("2017-01-01"),]
new_hosts_data <- new_hosts_data[order(as.Date(new_hosts_data$Host_since, format="%Y-%m-%d")),]
new_hosts_data$Host_since <- format(as.Date(new_hosts_data$Host_since, "%Y-%m-%d"), format="%Y-%m")
new_hosts_data_table <- table(new_hosts_data$Host_since)

# Plot
plot(as.Date(paste(format(names(new_hosts_data_table), format="%Y-%m"),"-01", sep="")), as.vector(new_hosts_data_table), type = "l", xlab = "Time", ylab = "Number of new hosts", col = "Blue")

The analysis indicates that the dataset spans until 2016, limiting our ability to ascertain trends in new host numbers beyond this point. However, from 2008 to 2015, there was a discernible increase in the number of hosts. Nevertheless, in the subsequent two years, specifically from 2015 to 2017, there was a notable decline in number of host.

Renting price per city quarter (arrondissement)

Number of listings by neighborhood

# Plot for number of listings by neighborhood
listings_neighb <- ggplot(New_data, aes(x = fct_infreq(Neighbourhood), fill = Room_type)) +
  geom_bar() +
  labs(title = "Number of Listings by Neighbourhood",
       x = "Neighbourhood", y = "Number of Listings") +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 75, hjust = 1), 
        plot.title = element_text(color = "black", size = 12,  hjust = 0.5))

# Plot the bar chart
listings_neighb

Average price per Neighbourhood

library(ggplot2)

# Calculate average daily price per city quarter
average_prices_per_arrond <- aggregate(cbind(New_data$Price),
                                       by = list(arrond = New_data$city_quarter),
                                       FUN = function(x) mean(x))

# Plot for average daily price per city quarter
price_arrond <- ggplot(data = average_prices_per_arrond, aes(x = arrond, y = V1)) +
  geom_bar(stat = "identity", fill = "lightblue", width = 0.7) +
  geom_text(aes(label = round(V1, 2)), size = 4) +
  coord_flip() +
  labs(title = "Average Daily Price per City Quarter",
       x = "City Quarters", y = "Average Daily Price") +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 90, hjust = 1), 
        plot.title = element_text(color = "black", size = 12,  hjust = 0.5))

# Display the plot
print(price_arrond)

The most expensive districts are : 1st to 8th and the 16th. Their average price goes from around 100 to 159 dollars. It’s probably due to the fact that most of the monuments and touristic areas are either inside or nearby these districts.

Other districts have a mean price between 66 and 88 dollars. Most of the listings are located in these districts.

New_data %>%
  group_by(Neighbourhood) %>%
  dplyr::summarize(num_listings = n(), borough = unique(Neighbourhood)) %>%
  top_n(n = 10, wt = num_listings) %>%
  ggplot(aes(x = fct_reorder(Neighbourhood, num_listings), y = num_listings, fill = borough)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top 10 neighborhoods by nb. of listings", x = "Neighbourhood", y = "Nb. of listings")

Visit frequency of the different quarters according to time.

table <- inner_join(New_data, R,by = "listing_id")
tab1 <- select(New_data,listing_id,city,city_quarter)
table = mutate(table,year = as.numeric(str_extract(table$date, "^\\d{4}")))

     
    p6 <- ggplot(table) +
      geom_bar(aes(y =city_quarter ,fill=factor(year)))+
      scale_size_area() +
      labs( x="Frequency", y="City quarter",fill="Year")+
      scale_fill_brewer(palette ="Spectral")
    
    ggplotly(p6)

The grapbh displays that the maximum listing was done in the year 2015. If the data for 2016 would be available after July, we would have seen a comparable figures between 2015 and 2016. We also observe that the listings are increasing each subsequent year since the inception of AirBnb gaining the popularity worldwide.

Number of rented Apartments in Paris neighbourhood over years

Evolution of apartments over years

#Convert Date type from factor to date
table["date"] <- table["date"] %>% map(., as.Date)


# Generating a table that aggregate data from data and id and count them
# to get the number of renting by host and date
longitudinal  <- table %>%
  group_by(date, Neighbourhood) %>%
  summarise(count_obs = n())
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
time_location <- (ggplot(longitudinal,
                         aes(x = date,
                             y = count_obs,
                             group = 1))
                  +  geom_line(size = 0.5,
                               colour = "lightblue")
                  +  stat_smooth(color = "darkblue",
                                  method = "loess")
                  +  scale_x_date(date_labels = "%Y")
                  +  labs(x = "Year",
                          y = "No. Rented Appartment")
                  +  facet_wrap(~ Neighbourhood))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
time_location
## `geom_smooth()` using formula = 'y ~ x'

The evolution of apartments over years shows similar pattern for all neighborhoods for which the listings have grown up, exceptionally for Bruttes-Montmartre and Popincourt.

Price range within Paris neighborhoods

# Filter data for Paris
paris_data <- New_data %>%
  filter(city == "Paris" & !is.na(longitude) & !is.na(latitude) & longitude != "" & latitude != "")

# Calculate average price for each neighborhood
avg_price_per_neighborhood <- paris_data %>%
  group_by(Neighbourhood) %>%
  summarize(Avg_Price = mean(Price))

# Create the violin plot
violin_plot <- ggplot(paris_data, aes(x = Neighbourhood, y = Price, fill = Neighbourhood)) +
  geom_violin() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title = "Price Range within Paris Neighborhoods", x = "Neighbourhood", y = "Price") +
  scale_fill_manual(values = rainbow(length(unique(paris_data$Neighbourhood))), 
                    guide = guide_legend(title = "Neighbourhood (Avg. Price)"),
                    breaks = avg_price_per_neighborhood$Neighbourhood,
                    labels = paste(avg_price_per_neighborhood$Neighbourhood, " (", round(avg_price_per_neighborhood$Avg_Price, 2), ")"))

# Print the violin plot
violin_plot

We can see that that the price is higher around the center of Paris.

From the above plot, it is evident that certain districts, such as Elysée, Opera, and Palais-Bourbon, exhibit a higher concentration of properties. This observation aligns with the understanding that real estate prices tend to be notably higher in these districts compared to others.

Location of listings

Neighborhood listings map:

This is an interactive map using Leaflet displaying the listings by neighborhood.

df <- select(L,longitude,neighbourhood,latitude,price)

leaflet(df %>% select(longitude,neighbourhood,
                      latitude,price))%>%
  setView(lng = 2.3488, lat = 48.8534 ,zoom = 10) %>%
   addTiles() %>% 
  addMarkers(clusterOptions = markerClusterOptions()) %>%
  addMiniMap()
## Assuming "longitude" and "latitude" are longitude and latitude, respectively

Superhosts listings map :

This is an interactive map using Leaflet displaying the listings owned by ‘Superhosts’ (a total of 2145 meaning around 4% of the total listings).

dfsuperhost <- select(New_data,longitude,Neighbourhood,latitude,Price)
dfsuperhost <- filter(New_data, Superhost =="t")
leaflet(dfsuperhost %>% select(longitude,Neighbourhood,
                      latitude,Price))%>%
  setView(lng = 2.3488, lat = 48.8534 ,zoom = 10) %>%
   addTiles() %>% 
  addMarkers(clusterOptions = markerClusterOptions()) %>%
  addMiniMap()
## Assuming "longitude" and "latitude" are longitude and latitude, respectively

Summary: Insights into Airbnb Listings in Paris

Property Type and Price

The predominant type of Airbnb listings in Paris are entire homes or apartments. Pricing of these listings is influenced by factors such as the number of beds, bedrooms, bathrooms, and capacity to accommodate guests. The type of listing (entire home or shared space) also plays a significant role in determining price.

Location and Pricing

There is a correlation between listing price and location. Neighborhoods with better amenities and higher desirability tend to have fewer Airbnb listings, but these listings command higher prices. Districts like Buttes-Montmartre, Popincourt, and Vaugirard are popular areas, while renowned Parisian quarters like Elysée, Palais-Bourbon, Louvre, and Luxembourg exhibit higher listing prices due to historical significance and tourist appeal.

Superhost Status

Only a minority of hosts achieve Superhost status on Airbnb. Superhosts are recognized for providing exceptional guest experiences, as evaluated by guest reviews and other criteria. The stringent evaluation process ensures that Superhosts maintain high standards of hospitality, enhancing trust and satisfaction among guests.

Conclusion

Our analysis underscores the intricate interplay between property attributes, location dynamics, and host reputation in shaping the Airbnb landscape in Paris. These insights provide valuable guidance for both hosts and guests navigating the vibrant short-term rental market in the city.

sessionInfo()
## R version 4.3.3 (2024-02-29 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.utf8    
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gridExtra_2.3      zoo_1.8-12         here_1.0.1         kableExtra_1.4.0  
##  [5] highcharter_0.9.4  corrplot_0.92      leaflet_2.2.2      plotly_4.10.4     
##  [9] writexl_1.5.0      ggpubr_0.6.0       ggmap_4.0.0        lubridate_1.9.3   
## [13] forcats_1.0.0      purrr_1.0.2        readr_2.1.4        tibble_3.2.1      
## [17] tidyverse_2.0.0    ggplot2_3.4.4      stringr_1.5.0      dplyr_1.1.3       
## [21] shiny_1.8.1        tidyr_1.3.0        skimr_2.1.5        DataExplorer_0.8.3
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7       rlang_1.1.1        magrittr_2.0.3     compiler_4.3.3    
##  [5] mgcv_1.9-1         png_0.1-8          systemfonts_1.0.5  vctrs_0.6.4       
##  [9] pkgconfig_2.0.3    crayon_1.5.2       fastmap_1.1.1      backports_1.4.1   
## [13] labeling_0.4.3     utf8_1.2.4         promises_1.2.1     rmarkdown_2.25    
## [17] tzdb_0.4.0         xfun_0.40          cachem_1.0.8       jsonlite_1.8.7    
## [21] later_1.3.2        jpeg_0.1-10        broom_1.0.5        parallel_4.3.3    
## [25] R6_2.5.1           bslib_0.5.1        stringi_1.7.12     RColorBrewer_1.1-3
## [29] rlist_0.4.6.2      car_3.1-2          jquerylib_0.1.4    Rcpp_1.0.11       
## [33] assertthat_0.2.1   knitr_1.44         base64enc_0.1-3    Matrix_1.6-5      
## [37] httpuv_1.6.15      splines_4.3.3      igraph_2.0.3       timechange_0.2.0  
## [41] tidyselect_1.2.0   rstudioapi_0.15.0  abind_1.4-5        yaml_2.3.7        
## [45] curl_5.1.0         lattice_0.22-5     plyr_1.8.9         quantmod_0.4.26   
## [49] withr_2.5.1        evaluate_0.22      xts_0.13.2         xml2_1.3.5        
## [53] pillar_1.9.0       carData_3.0-5      generics_0.1.3     TTR_0.24.4        
## [57] rprojroot_2.0.4    hms_1.1.3          munsell_0.5.0      scales_1.2.1      
## [61] xtable_1.8-4       glue_1.6.2         lazyeval_0.2.2     tools_4.3.3       
## [65] data.table_1.14.8  ggsignif_0.6.4     cowplot_1.1.3      grid_4.3.3        
## [69] crosstalk_1.2.1    colorspace_2.1-0   nlme_3.1-164       networkD3_0.4     
## [73] repr_1.1.7         cli_3.6.1          fansi_1.0.5        viridisLite_0.4.2 
## [77] svglite_2.1.3      gtable_0.3.4       rstatix_0.7.2      sass_0.4.7        
## [81] digest_0.6.33      htmlwidgets_1.6.4  farver_2.1.1       htmltools_0.5.8.1 
## [85] lifecycle_1.0.3    httr_1.4.7         mime_0.12